Slavonic Corpus for Stylometry Research

نویسندگان

  • Jan Svec
  • Jan Rygl
چکیده

Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Old Sources and Modern Procedures: Computer Processing of Old-Church Slavonic

A framework for computer processing of Old-Church Slavonic including its specific features is presented. The corpus of Old-Church Slavonic and its annotation is introduced. Incorporation of manually pre-prepared card catalogues into a corpus is proposed.

متن کامل

CHURCH SLAVONIC AND CROATIAN HISTORICAL LEXICOGRAPHY IN THE LEXICOGRAPHIC AGE OF GOLD International conference Church Slavonic and Croatian historical lexicography

Owing to the uncomparable quantity of people working on so many diff erent dictionaries and with almost all results being splendid – claims Richard Bailey – we live in the golden age of lexicography. The current state, achievements and perspectives of Church Slavonic as well as Croatian historical lexicography in this presumed lexicographic age of gold were presented and discussed during the co...

متن کامل

CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text

We present the CLiPS Stylometry Investigation (CSI) corpus, a new Dutch corpus containing reviews and essays written by university students. It is designed to serve multiple purposes: detection of age, gender, authorship, personality, sentiment, deception, topic and genre. Another major advantage is its planned yearly expansion with each year’s new students. The corpus currently contains about ...

متن کامل

Cross-Genre Author Profile Prediction Using Stylometry-Based Approach

Author profiling task aims to identify different traits of an author by analyzing his/her written text. This study presents a Stylometry-based approach for detection of author traits (gender and age) for cross-genre author profiles. In our proposed approach, we used different types of stylistic features including 7 lexical features, 16 syntactic features, 26 character-based features and 6 vocab...

متن کامل

TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling

Personality profiling is the task of detecting personality traits of authors based on writing style. Several personality typologies exist, however, the Myers-Briggs Type Indicator (MBTI) is particularly popular in the non-scientific community, and many people use it to analyse their own personality and talk about the results online. Therefore, large amounts of self-assessed data on MBTI are rea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015